GEV-Canonical Regression for Accurate Binary Class Probability Estimation when One Class is Rare
نویسندگان
چکیده
We consider the problem of binary class probability estimation (CPE) when one class is rare compared to the other. It is well known that standard algorithms such as logistic regression do not perform well in this setting as they tend to underestimate the probability of the rare class. Common fixes include under-sampling and weighting, together with various correction schemes. Recently, Wang & Dey (2010) suggested the use of a parametrized family of asymmetric link functions based on the generalized extreme value (GEV) distribution, which has been used for modeling rare events in statistics. The approach showed promising initial results, but combined with the logarithmic CPE loss implicitly used in their work, it results in a non-convex composite loss that is difficult to optimize. In this paper, we use tools from the theory of proper composite losses (Buja et al., 2005; Reid & Williamson, 2010) to construct a canonical underlying CPE loss corresponding to the GEV link, which yields a convex proper composite loss that we call the GEV-canonical loss; this loss can be tailored to CPE settings where one class is rare, and is easily minimized using an IRLS-type algorithm similar to that used for logistic regression. Our experiments on both synthetic and real data suggest that the resulting algorithm – which we term GEVcanonical regression – performs well compared to common approaches such as under-sampling and weights-correction for this problem. Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR:W&CP volume 32. Copyright 2014 by the author(s).
منابع مشابه
UCD GEARY INSTITUTE DISCUSSION PAPER SERIES Generalized Extreme Value Regression for Binary Rare Events Data: an Application to Credit Defaults
The most used regression model with binary dependent variable is the logistic regression model. When the dependent variable represents a rare event, the logistic regression model shows relevant drawbacks. In order to overcome these drawbacks we propose the Generalized Extreme Value (GEV) regression model. In particular, in a Generalized Linear Model (GLM) with binary dependent variable we sugge...
متن کاملAccounting for secondary variable for the classification of mineral resources using co-kriging technique; a Case study of Sarcheshmeh porphyry copper deposit
Due to substantial effect of classification of resource models on future mine planning, one should come with an accurate method of estimation to guarantee that the minimum error is acquired in the estimation process. The known world class Cu-Mo deposit, Sarcheshmeh Porphyry deposit (central Iran) selected as the study area. The Hypogene zone of the deposit was chosen as the space in which estim...
متن کاملThe Probit Link Function in Generalized Linear Models for Data Mining Applications
The use of logistic regression for outcome classification of dichotomous variables is well known in data mining applications. The estimated probability of the logit transformation belongs to the class of canonical link functions that follow from particular probability distribution functions. A closely related model is the probit link which can be used for binary responses. Although the probit l...
متن کاملEstimating Binary Spatial Autoregressive Models for Rare Events∗
This paper proposes a new statistical estimator, to be applied to the prediction of state failures. State failures are typically conceptualised in a binary fashion—a state fails or it does not—and are rare events. Furthermore, state failures are not geographically independent events. The failure of one state can be expected to have an impact on the stability and peace in neighboring states, inc...
متن کاملMulticategory large-margin unified machines
Hard and soft classifiers are two important groups of techniques for classification problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classifiers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classification task or not. In particular, soft classifiers predic...
متن کامل